An Efficient Algorithm for Outlier Detection in High Dimensional Real Databases

نویسندگان

  • Carlos H. C. Teixeira
  • Gustavo H. Orair
  • Wagner Meira
  • Srinivasan Parthasarathy
چکیده

Detecting outlier patterns in data has been an important research topic in statistics, data mining and machine learning communities for many years. Research in identifying effective solutions to this problem have several interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Among the different algorithms, statistical (parametric) approaches and distance-based outlier detection are the most popular in use. The former is well grounded but often has difficulty scaling to large and high dimensional data. The latter is relatively efficient and empirically found to be effective on a number of domains but scalability is still an issue in spite of a fair bit of research on the topic. To address this limitation, in this work, we propose Atalaia, an efficient and scalable distance-based algorithm for detecting outliers in large high dimensional databases. Central to our algorithm is a fast strategy to estimate the unusualness of a record within the database and use a rank-ordered approach to evaluate records. Our algorithm partitions the database and ranks the objects that are candidates to be an outlier, reducing significantly the number of comparisons among objects. We evaluate different ranking heuristics in a comprehensive set of real and synthetic databases. Further, Atalaia also handles heterogeneous databases, i.e, those containing both categorical and continuous attributes. The results show that our algorithm outperforms by up to 73% the state-of-the-art distance-based outlier detection algorithm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data

Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...

متن کامل

Detecting High-Dimensional Outliers: the New Task, Algorithms and Performance

Outlier detection is a fundamental step in knowledge discovery in databases. With the increasing number of high-dimensional databases, existing outlier detection algorithms that work only in the context of full space are unable to effectively screen out informative outliers. This is because majority of these outliers exists only in subspaces. In this paper, we identify a new outlier detection t...

متن کامل

Outlier detection for high dimensional data pdf

Is particularly useful for high dimensional data where outliers cannot be found.High dimensional data in Euclidean space pose special challenges to data. In about just the last few years, the task of unsupervised outlier detection has found.Outlier detection is an outstanding data mining task referred to open pdf with mac word class="text" href="https://tokiqivy.files.wordpress.com/2015/06/opel...

متن کامل

Application of Recursive Least Squares to Efficient Blunder Detection in Linear Models

In many geodetic applications a large number of observations are being measured to estimate the unknown parameters. The unbiasedness property of the estimated parameters is only ensured if there is no bias (e.g. systematic effect) or falsifying observations, which are also known as outliers. One of the most important steps towards obtaining a coherent analysis for the parameter estimation is th...

متن کامل

Outlier Detection for Support Vector Machine using Minimum Covariance Determinant Estimator

The purpose of this paper is to identify the effective points on the performance of one of the important algorithm of data mining namely support vector machine. The final classification decision has been made based on the small portion of data called support vectors. So, existence of the atypical observations in the aforementioned points, will result in deviation from the correct decision. Thus...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008